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Abstract 

fffe consider the ubiquitous technique of VByte compression, 
which represents each integer as a variable length sequence of 
bytes. The low 7 bits of each byte encode a portion of the integer, 
and the high bit of each byte is reserved as a continuation flag. 
This flag is set to 1 for all bytes except the last, and the decod¬ 
ing of each integer is complete when a byte with a high bit of 0 
is encountered. VByte decoding can be a performance bottleneck 
especially when the unpredictable lengths of the encoded integers 
cause frequent branch mispredictions. Previous attempts to accel¬ 
erate VByte decoding using SIMD vector instructions have been 
disappointing, prodding search engines such as Google to use 
more complicated but faster-to-decode formats for performance- 
critical code. Our decoder (MASKED VBYTEj is 2 to 4 times 
faster than a conventional scalar VByte decoder, making the for¬ 
mat once again competitive with regard to speed. 

I. Introduction 

In many applications, sequences of integers are compressed 
with VByte to reduce memory usage. For example, it is 
part of the search engine Apache Fucene (under the name 
vInt). It is used by Google in its Protocol Buffers inter¬ 
change format (under the name Varint) and it is part of the 
default API in the Go programming language. It is also 
used in databases such as IBM DB2 (under the name Vari¬ 
able Byte) IT]. 

We can describe the format as follows. Given a non¬ 
negative integer in binary format, and starting from the 
least signihcant bits, we write it out using seven bits in 
each byte, with the most significant bit of each byte set to 
0 (for the last byte), or to 1 (in the preceding bytes). In this 
manner, integers in [0, 2^) are coded using a single byte, 
integers in [2^, 2^^) use two bytes and so on. See Table 
for examples. 

The VByte format is applicable to arbitrary integers in¬ 
cluding 32-bit and 64-bit integers. However, we focus on 
32-bit integers for simplicity. 

Differential coding A common application in informa¬ 
tion retrieval is to compress the list of document identifiers 
in an inverted index 0. In such a case, we would not 
code directly the identifiers {xi,X 2 , ■. •), but rather their 
successive differences (e.g., xi — 0 ,X 2 — xi ,...), some¬ 
times called deltas or gaps. If the document identihers are 
provided in sorted order, then we might expect the gaps 
to be small and thus compressible using VByte. We refer 
to this approach as differential coding. There are several 
possible approaches to differential coding. For example. 


if there are no repeated values, we can subtract one from 
each difference (xi — 0, 0:2 — — 1, iCa — a ;2 — 1,...) 

or we can subtract blocks of integers for greater speed 

(Xi,X 2 ,X 3 , X 4 , X5 - Xi,Xe - X2,Xr - X3, Xs - X4,.. .). 

For simplicity, we only consider gaps defined as succes¬ 
sive differences (x 2 — xi,...). In this instance, we need to 
compute a prefix sum over the gaps to recover the original 
values (i.e., Xi = {xi — Xi-i) -|- Xi-i). 

II. Efficient VByte Decoding 

One of the benefits of the VByte format is that we can write 
an efficient decoder using just a few lines of code in almost 
any programming language. A typical decoder applies Al- 
gorithm[T] In this algorithm, the function readByte pro¬ 
vides byte values in [0, 2®) representing a number x in the 
VByte format. 

Processing each input byte requires only a few inexpen¬ 
sive operations (e.g., two additions, one shift, one mask). 
However, each byte also involves a branch. On a recent 
Intel processor (e.g., one using the Haswell microarchi¬ 
tecture), a single mispredicted branch can incur a cost of 
15 cycles or more. When all integers are compressed down 
to one byte, mispredictions are rare and the performance is 
high. However, when both one and two byte values occur 
in close proximity the branch may become less predictable 
and performance may suffer. 

For differential coding, we modify this algorithm so that 
it decodes the gaps and computes the prehx sum. It suffices 
to keep track of the last value decoded and add it to the 
decoded gap. 

III. SIMD Instructions 

Intel processors provide SIMD instructions operating on 
128-bit registers (called XMM registers). These registers 

Table 1: VByte form for various powers of two. Within each 
word, the most significant bits are presented first. In 
the VByte form, the most significant bit of each byte is in 
bold. 


integer 

binary form (16 bits) 

VByte form 

1 

0000000000000001 

00000001 

2 

0000000000000010 

00000010 

4 

0000000000000100 

00000100 

128 

0000000010000000 

10000000, 00000001 

256 

0000000100000000 

10000000, 00000010 

512 

0000001000000000 

10000000. 00000100 

16384 

0100000000000000 

10000000,10000000, 00000001 

32768 

1000000000000000 

10000000,10000000, 00000010 
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Algorithm 1 Conventional VByte decoder. The continue 
instruction returns the execution to the main loop. The 
readByte function returns the next available input byte. 
1: 2/ empty arTay of 32-bit integers 
2: while input bytes are available do 
3: h -tr- readByte() 

4: if 6 < 128 then append btoy and continue 

5: c b mod 2^ 

6: & •(— readByte() 

7: if 6 < 128 then append c + 6 x 2^ to ?/ and continue 

8: c c + (6 mod 2’^) x 2"^ 

9: & readByte() 

10: if & < 128 then append c+6 x 2^^ to y and continue 

11: c c + (6 mod 2^) X 2^^ 

12: b -ir- readByte() 

13: if & < 128 then append c+bx 2^^ to y and continue 

14: c ■(— c+ (b mod 2^) x 2^^ 

15: b •(— readByte() 

16: append c + & X 2^® to 1/ 

17: return y 


can be considered as vectors of two 64-bit integers, vector2 
of four 32-bit integers, vectors of eight 16-bit integers or 
vectors of sixteen 8-bit integers. 

We review the main SIMD instructions we require in Ta¬ 
ble]^ We can roughly judge the computational cost of an 
instruction by its latency and reciprocal throughput. The 
latency is the minimum number of cycles required to exe¬ 
cute the instruction. The latency is most important when 
subsequent operations have to wait for the instruction to 
complete. The reciprocal throughput is the inverse of the 
maximum number of instructions that can be executed per 
cycle. For example, a reciprocal throughput of 0.5 means 
that up to two instructions can be executed per cycle. 

We use the movdqu instruction to load or store a reg¬ 
ister. Loading and storing registers has a relatively high 
latency (3 cycles). While we can load two registers per cy¬ 
cle, we can only store one of them to memory. A typical 
SIMD instruction is paddd: it adds two vectors of four 
32-bit integers at once. 

Sometimes it is necessary to selectively copy the content 
from one XMM register to another while possibly copy¬ 
ing and duplicating components to other locations. We 
can do so with the pshufd instruction when consider¬ 
ing the registers as vectors of 32-bit integers, or with the 
pshufb instruction when registers is considered vectors 
of bytes. These instructions take an input register v as 
well as a control mask m and they output a new vector 
{vmo I ’’Jmi, Vm-i, Vrm, ■ ■ ■) with the added convention that 
V-i = 0. Thus, for example, the pshufd instruction can 
copy one particular value to all positions (using a mask 


made of 4 identical values). If we wish to shift by a number 
of bytes, it can be more efficient to use a dedicated instruc¬ 
tion (psrldq or pslldq) even though the pshufb in¬ 
struction could achieve the same result. Similarly, we can 
use the pmovsxbd instruction to more efficiently unpack 
the first four bytes as four 32-bit integers. 

We can simultaneously shift right by a given number of 
bits all of the components of a vector using the instruc¬ 
tions psrlw (16-bit integers), psrld (32-bit integers) and 
psrlq (64-bit integers). There are also corresponding left- 
shift instructions such as psllq. We can also compute the 
bitwise OR and bitwise AND between two 128-bit registers 
using the por and pand instructions. 

There is no instruction to shift a vector of 16-bit integers 
by different number of bits (e.g., {vi,V 2 , ■ ■ •) —("Ui 
1, r ;2 2,...)) but we can get the equivalent result by 

mutiplying integers (e.g., with the pmullw instruction). 
The AVX2 instruction set introduced such flexible shift in¬ 
structions (e.g., vpsrlvd), and they are much faster than a 
multiplication, but they not applicable to vectors of 16-bit 
integers. Intel proposed a new instruction set (AVX-512) 
which contains such an instruction (vpsrlvw) but it is not 
yet publicly available. 

Our contribution depends crucially on the pmovmskb 
instruction. Given a vector of sixteen bytes, it outputs a 16- 
bit value made of the most significant bit of each of the six¬ 
teen input bytes: e.g., given the vector (128,128,..., 128), 
pmovmskb would output OxFFFF. 

IV. Masked VByte decoding 

The conventional VByte decoders algorithmically process 
one input byte at a time (see Algorithm [T]). To multiply 
the decoding speed, we want to process larger chunks of 
input data at once. Thankfully, commodity Intel and AMD 
processors have supported Single instruction, multiple data 
(SIMD) inshTictions since the introduction of the Pentium 4 
in 2001. These instructions can process several words at 
once, enabling vectorized algorithms. 

Stepanov et al. HI used SIMD instructions to accelerate 
the decoding of VByte data (which they call varint-SU). 
According to their experimental results, SIMD instructions 
lead to a disappointing speed improvement of less than 
25 %, with no gain at all in some instances. To get higher 
speeds (e.g., an increase of 3x), they proposed instead new 
formats akin to Google’s Group Varint m. For simplic¬ 
ity, we do not consider such “Group” alternatives further: 
once we consider different data format, a wide range of fast 
SIMD-based compression schemes become available El- 
some of them faster than Stepanov et al.’s fastest proposal. 

Though they did not provide a detailed description, 
Stepanov et al.’s approach resembles ours in spirit. Con¬ 
sider the simplified example from Fig. [T] It illustrates the 
main steps: 

• From the input bytes, we gather the control bits 










1®* International Symposium on Web AlGorithms • June 2015 


Table 2: Relevant SIMD instructions on Haswell Intel processors with latencies and reciprocal throughput in CPU cycles . 


instruction 

description 

latency 

rec. through¬ 
put 

movdqu 

store or retrieve a 128-bit register 

3 

1/0.5 

paddd 

add four pairs of 32-bit integers 

1 

0.5 

pshufd 

shuffle four 32-bit integers 

1 

1 

pshufb 

shuffle sixteen bytes 

1 

1 

psrldq 

shift right by a number of bytes 

1 

0.5 

pslldq 

shift left by a number of bytes 

1 

0.5 

pmovsxbd 

unpack the first four hytes into four 32-hit ints. 

1 

0.5 

pmovsxwd 

unpack the first four 16-hit integers into four 32-bit ints. 

1 

0.5 

psrlw 

shift right eight 16-bit integers 

1 

1 

psrld 

shift right four 32-hit integers 

1 

1 

psrlq 

shift right two 64-bit integers 

1 

1 

psllq 

shift left two 64-bit integers 

1 

1 

por 

bitwise OR between two 128-bit registers 

1 

0.33 

pand 

bitwise AND between two 128-bit registers 

1 

0.33 

pmullw 

multiply eight 16-bit integers 

5 

1 

pmovmskb 

create a 16-bit mask from the most significant bits 

3 

1 


(1,0,1,0,0,0 in this case) using the pmovmskb instruc¬ 
tion. 

• From the resulting mask, we look up a control mask 
in a table and apply the pshufb instruction to move 
the bytes. In our example, the first 5 bytes are left in 
place (at positions 1,2, 3, 4, 5) whereas the 5*^ byte 
is moved to position 7. Other output bytes are set to 
zero. 

• We can then extract the first 7 bits of the low bytes (at 
positions 1, 3, 5, 7) into a new 8-byte register. We can 
also extract the high bytes (positions 2, 4, 6, 8) into 
another 8-byte register. On this second register, we 
apply a right shift by 1 bit on the four 16-bit values 
(using psrlw). Finally, we compute the bitwise OR 
of these two registers, combining the results from the 
low and high bits. 

A naive implementation of this idea could be slow. In¬ 
deed, we face several performance challenges: 

• The pmovmskb instruction has a relatively high la¬ 
tency (e.g., 3 cycles on the Haswell microarchitec¬ 
ture). 

• The pmovmskb instruction processes 16 bytes at 
once, generating a 16-bit result. Yet looking up a 16- 
bit value in a table would require a 65536-value table. 
Such a large table is likely to stress the CPU cache. 

Moreover, we do not know ahead of time where coded 
integers begin and end: a typical segment of 16 bytes 
might contain the end of one compressed integer, a 
few compressed integer at the beginning of another 
compressed integer. 


Our proposed algorithm works on 12 bytes inputs and 
12-bit masks. In practice, we load the input bytes in a 128- 
bit register containing 16 bytes (using movdqu), but only 
the first 12 bytes are considered. For the time being, let us 
assume that the segment begins with a complete encoded 
integer. Moreover, assume that the 12-bit mask has been 
precomputed. 

In what follows, we use the convention that ■ ■)k is 
a vector of A:-bit integers. Because numbers are stored in 
binary notation, we have that 

(1,0,0,0)8 = (1,0)i 6 = (1)32, 
that is, all three vectors represent the same binary data. 

• If the mask is 00 • • • 00, then the 12 input bytes repre¬ 
sent 12 integers as is. We can unpack the first 4 bytes 
to 4 32-bit integers (in a 128-bit register) with the 
pmovsxbd instruction. This new register can then be 
stored in the output buffer. We can then shift the input 
register by 4 bytes using the psrldq instruction, and 
apply to pmovsxbd instruction again. Repeating a 
third time, we have decoded all 12 integers. We have 
consumed 12 input bytes and written 12 integers. 

• Otherwise we use the 12-bit mask to look two 8-bit 
values in a table 2^^-entries-wide. The first 8-bit value 
is an integer between 2 and 12 indicating how many 
input bytes we consume. Though it is not immediately 
useful to know how many bytes are consumed, we use 
this number of consumed bytes when loading the next 
input bytes. The second 8-bit value is an index i taking 
integer values in [0,170). From this index, we load up 
one of 170 control masks. We then proceed according 
to the value of the index i: 
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Compressed data (6 bytes) 


Mask (6 bits) 


10000000 

Extract mask " ^ 

00000001 

^ „ (pmovmskb) 

10000010 

Permute 

(pshufb) 

- using mask - ►- 

00000011 

and look-up 


00010000 

table 

00100000 



1 

0 

1 

0 

0 

0 


Permuted bytes 


10000000 

00000001 

10000010 

00000011 

00010000 

0 

00100000 

0 


Decoded values from the permuted bytes by 
extracting the 4 low and 4 high bytes with 
masks, shifting down by 1 bit the high bytes 
and ORing. 


10000000 

0 

10000010 

00000001 

00010000 

0 

00100000 

0 


low bytes high bytes 


Integers 128, 386, 16, 32 


Figure 1: Simplified illustration of vectorized VByte decoding from 6 bytes to four 16-bit integers (128, 386, 16, 32). 


- If z < 64, then the next 6 integers each fit in at 
most two bytes (they are less than 2^^). There 
are exactly 2® = 64 cases corresponding to this 
scenario. That is, the first integer can fit in one 
or two bytes, the second integer in one or two 
bytes, and so on, generating 64 distinct cases. 
For each of the 6 integers Xi, we have the low 
byte containing the least significant 7 bits of the 
integer Oi, and optionally a high byte containing 
the next 7 bits bi (xi = + 5^2^). The call 

to pshufb will permute the bytes such that the 
low bytes occupy the positions 1, 3, 5, 7, 9, 11 
whereas the high bytes, when available, occupy 
the positions 2, 4, 6 , 8 , 10, 12. When a high 
byte is not available, the byte value zero is used 
instead. 

For example, when all 6 values are in [2^, 2^^), 
the permuted bytes are 

(oil, 6 i, 02 !, & 2 , 03 !, 63 ,..., oel, be)s 

when presented as a vector of bytes with the 
short-hand notation a^l = + 2 ^. 

From these permuted bytes, we generate two 
vectors using bitwise ANDs with fixed masks 
(using pand). The first one retains only the least 
significant 7 bits of the low bytes: as a vector of 
16-bit integers we have 

(oil, 61 , 02 !, & 2 , 03 !; i'S) • • ■) flel) ^ 0)8 

AND 

(127,0,127,0,127,0,..., 127,0)8 

= («!, «2, 0 , 3 , ■ ■ ■ , a6)l6- 
The second one retains only the high bytes: 

(0, 61 ,0, 62 , 0, 63 ,..., 0, bQ)s. 

Considering the latter as a vector of 16-bit inte¬ 
gers, we right shift it by 1 bit (using psrlw) to 
get the following vector 

( 6 i 2 ^, 622 ^, 632 ^,..., & 62 ^)i 6 - 


We can then combine (with a bitwise OR using 
por) this last vector with the vector containing 
the least significant 7 bits of the low bytes. We 
have effectively decoded the 6 integers as 16-bit 
integers: we get 

(tti -|- 6i2^, 02 -|- &22^, 

03 + ^32^, 04 -|- 642^, 

0,5 + ^52^, Og + 662 ^)i6. 

We can unpack the first four to 32-bit integers 
using an instruction such as pmovsxwd, we can 
then shift by 8 bytes (using psrldq) and apply 
pmovsxwd once more to decode the last two 
integers. 

- If 64 < z < 145, the next 4 encoded integers 
fit in at most 3 bytes. We can check that there 
are 81 = 3^ such cases. The processing is then 
similar to the previous case except that we have 
up to three bytes per integer (low, middle and 
high). The permuted version will re-arrange the 
input bytes so that the first 3 bytes contain the 
low, middle and high bytes of the first integer, 
with the convention that a zero byte is written 
when there is no corresponding input byte. The 
next byte always contain a zero. Then we store 
the data corresponding to the next integer in the 
next 3 bytes. A zero byte is added. And so on. 
This time, we create 3 new vectors using bit¬ 
wise ANDs with appropriate masks: one retain¬ 
ing only the least significant 7 bits from the low 
bytes, another retaining only the least significant 
7 bits from the middle bytes and another retain¬ 
ing only the high bytes. As vectors of 32-bit in¬ 
tegers, the second vector is right shifted by 1 bit 
whereas the third vector is right shifted by 2 bits 
(using psrld). The 3 registers are then com¬ 
bined with a bitwise OR and written to the out¬ 
put buffer. 

- Finally, when 145 < i < 170), we decode the 
next 2 integers. Each of these integers can con- 
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sume from 1 to 5 input bytes. There are 5^ = 25 
such cases. 

For simplicity of exposition, we only explain 
how we decode the first of the two integers us¬ 
ing 8-byte buffers. The integer can be written as 
xi= ai + bi2^ + Ci2^‘* -I- fii2^^ + ei2^® where 
tti, 6i, Cl, di € [0, 2’^) and ei G [0, 2^). Assum¬ 
ing that a;i > 2^®, then the first 5 input bytes 
will be (oil, 5il, cil, dll, ei)8. 

Irrespective of the value of the index i, the first 
step is to set the most significant bit of each input 
byte to 0 with a bitwise AND. Thus, if xi > 2^®, 
we get (tti, bi, Cl, di, ei)8. 

We then permute the bytes so that we get the fol¬ 
lowing 8 bytes: 

Y = {bi, Cl, di, Cl + ai2®)i6. 


where again we use * to indicate irrelevant byte 
values. We can permute this last vector to get 

(oi -I- 6i2^ mod 2®, bi ^2 + Ci2® mod 2®, 

Cl -G 2^ -I- di2® mod 2®, 
di -G 2® -I- ei2® mod 2®,.. .)8 

= (fli -|- 5i2^ -|- Ci2®^ -|- di2^® -|- ei2^®,.. .)32 

Thus, we have effectively decoded the integer 

Xi. 

The actual routine works with two integers 
{xi,X 2 )- The content of the first one is initially 
stored in the first 8 bytes of a 16-byte vector 
whereas the remaining 8 bytes are used for the 
second integer. Both integers are decoded simul¬ 
taneously. 


The last byte is occupied by the value oi which 
we can isolate for later use by shifting right the 
whole vector by seven bytes (using psrldq): 

y' = (ai,0,0,0,0,0,0,0)8. 

Using the pmullw instruction, we multiply the 
permuted bytes (Y) by the vector 

( 2 ^ 2 ®, 2 ®, 2^)16 

to get 

(6i2^ ci2®, di2®, ei2‘‘ -f {ai2^^ mod 2®®))i6. 
As a byte vector, this last vector is equivalent to 

X = (5i2^ mod 2®, hi -G 2, 
ci2® mod 2 ®,ci-g 2^ 
di2® mod 2®, di -G 2®, 
ei2® mod 2®,*)8 

where we used -k to indicate an irrelevant byte 
value. We can left shift this last result by one 
byte (using psllq): 

a:' = (0,6i2^ mod 2®, 
bi -G 2, Cl2® mod 2®, 

Cl -G 2^, di2® mod 2®, 
di -G 2®, ei2® mod 2®)8 

We can combine these two results with the value 
ai isolated earlier (Y'): 

r' OR a: OR X = (oi -f 5i2^ mod 2®, *, 

6i -G 2 -b ci2® mod 2®, ★, 

Cl -b 2^ -b (ii2® mod 2®, *, 
di -b 2® -b ei2® mod 2®,*)8 


An important motivation is to amortize the latency of the 
pmovmskb instruction as much as possible. First, we re¬ 
peatedly call the pmovmskb instruction until we have pro¬ 
cessed up to 48 bytes to compute a corresponding 48-bit 
mask. Then we repeatedly call the decoding procedure as 
long as 12 input bits remain out of the processed 48 bytes. 
After reach call to the 12-byte decoding procedure, we left 
shift the mask by the number of consumed bits. Recall that 
we look up the number of consumed bytes at the beginning 
of the decoding procedure so this number is readily avail¬ 
able and its determination does not cause any delay. When 
fewer than 12 valid bits remain in the mask, we process 
another block of 48 input bytes with pmovmskb. To ac¬ 
celerate further this process, and if there are enough input 
bytes, we maintain two 48-bit masks (representing 96 in¬ 
put bytes): in this manner, a 48-bit mask is already avail¬ 
able while a new one is being computed. When fewer than 
48 input bytes but more than 16 input bytes remain, we call 
pmovmskb as needed to ensure that we have at least a 12- 
bit mask. When it is no longer possible, we fall back on 
conventional VByte decoding. 

Differential Coding We described the decoding proce¬ 
dure without accounting for differential coding. It can be 
added without any major algorithmic change. We just keep 
track of last 32-bit integer decoded. We might store it in 
the last entry of a vector of four 32-bit integers (henceforth 
P = iPl,P2,P3,P4))- 

We compute the prefix sum before to writing the de¬ 
coded integers. There are two cases to consider. We either 
write four 32-bit integers or 2 32-bit integers (e.g., when 
writing 6 decoded integers, we first write 4, then 2 inte¬ 
gers). In both cases, we permute first the entries of p so 
that p ^ {pi,Pi,Pi,Pi) using the pshuf d instruction. 


• Suppose we decoded four integers in the vector c. 
We left shift the content of c by one integer (using 
pslldq) so that c' ^ (0, ci,C 2 ,C 3 ). We add c 
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to d using paddd so that c ^ (ci,ci + C 2 ,C 2 + 
C3,C3 + C4). We shift the rest by two integers (us¬ 
ing again pslldq) d ^ ( 0 , 0 ,ci,ci -b C 2 ) and add 
c-\-d = (ci, Cl -I-C2, Cl -I-C2-I-C3, Cl -I-C2-I-C3 + C4). Fi¬ 
nally, we add p to this last result p ^ p-|-c-|-c' = (p 4 -l- 

Ci, pp -b Cl -b C 2 , p-b Cl -b C 2 -b C 3 , p-b Cl -b C 2 -b C 3 -b C 4 ). 

We can write p as the decoded output. 

• The process is similar though less efficient if we only 
have two decoded gaps. We start from a vector con¬ 
taining two gaps c ^ (ci, C 2 , *, *) where we indi¬ 
cate irTelevant entries with *. We can left shift by 
one integer d ^ ( 0 , ci,C 2 ,*) and add the result 
c -b c' = (ci, Cl -b C 2 ,*, *). Using the pshufd in¬ 
struction, we can copy the value of the second com¬ 
ponent to the third and fourth components, generating 
(ci, Cl -b C 2 , Cl -b C 2 , Cl -b C 2 ). We can then add p to 
this result and store the result back into p. The first 
two integers can be written out as output. 

V. Experiments 

We implemented our software in C and C-H-. The bench¬ 
mark program ran on a Linux server with an Intel i7-4770 
processor running at 3.4 GHz. This Haswell processor 
has 32kB of LI cache and 256kB of L2 cache per core 
with SMB of L3 cache. The machine has 32 GB of RAM 
(DDR3-1600 with double-channel). We disabled Turbo 
Boost and set the processor to run at its highest clock 
speed. We report wall-clock timings. Our software is 
freely available under an open-source license (http: // 
maskedvbyte . org) and was compiled using the GNU 
GCC 4.8 compiler with the -03 flag. 

For our experiments, we used a collection of posting 
lists extracted from the ClueWeb09 (Category B) data set. 
ClueWeb09 includes 50 million web pages. We have one 
posting list for each of the 1 million most frequent words— 
after excluding stop words and applying lemmatization. 
Documents were sorted lexicographically based on their 
URL prior to attributing document identifiers. The post¬ 
ing lists are grouped based on length: we store and process 
lists of lengths 2^ to 2 ^+^ — 1 together for all values of 
K. Coding and decoding times include differential coding. 
Shorter lists are less compressible than longer lists since 
their gaps tend to be larger. Our results are summarized 
in Fig. 1^ For each group of posting lists we compute the 
average bits used per integer after compression: this value 
ranges from 8 to slightly less than 16. All decoders work 
on the same compressed data. 

When decoding long posting lists to RAM, our speed 
is limited by RAM throughput. For this reason, we de¬ 
code the compressed data sequentially to buffers fitting in 
LI cache (4096 integers). For each group and each de¬ 
coder, we compute the average decoding speed in millions 
of 32-bit integers per second (mis). For our Masked 
VByte decoder, the speeds ranges from 2700 mis for the 




Figure 2: Performance comparison for various sets of posting 
lists (Clue Web) 


most compressible lists to 650 mis for the less compress¬ 
ible ones. The speed of the conventionalVByte decoder 
ranges from 1100 mis to 300 mis. For all groups of post¬ 
ing lists in our experiments, the Masked VByte decoder 
was at least twice as fast as the conventional VByte de¬ 
coder. However, for some groups, the speedup is between 
3x and 4x. 

If we fully decode all lists instead of decoding to a 
buffer that fits in CPU cache, the performance of Masked 
VByte can be reduced by about 15%. For example, in¬ 
stead of a maximal speed of 2700 mis. Masked VByte 
is limited to 2300 mis. 


VL Conclusion 

To our knowledge, no existing VByte decoder comes close 
to the speed of Masked VByte. Given how the VByte 
format is a de facto standard, it suggests that Masked 
VByte could help optimize a wide range of existing soft¬ 
ware without affecting the data formats. 

Masked VByte is in production code at Indeed as part 
of the open-source analytics platform Imhotep (http : / / 
indeedeng . github . io/imhotep/). 
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